Visualization in Python

Background

Why visualize?

  • Discovery
  • Inference
  • Communication

Terminology

  • Representation
    • Environment for visualization (e.g., 2d, 3d, sound)
  • Idiom
    • Constructs used (e.g., bar plot, area plot)
  • Task
    • What the user is trying to do (e.g., compare, predict, find relationships)
  • Design
    • Choice of the representation(s) and idiom(s) to perform the task

Question: What would be an effective way to visualize:

  • Average family income in US over the last 10 years?
  • Average family income by state in 2016?
  • Average family income by state over the last 10 years?

Software Engineering & Visualization

There are many python packages for visualization.

  • pandas – Visualization of pandas objects
  • matplotlib – MATLAB plotting in python
  • seaborn – Statistical visualizations
  • bokeh – Interactive visualization using the browser
  • HoloViews – Simplified visualization of engineering/scientific data
  • VisPy – fast, scalable, simple interactive scientific visualization
  • Altair – declarative statistical visualization

We'll begin with visualization in pandas and focus on matplotlib. There is great documentation on all of this. The case study is to analyze the flow of bicycles out of stations in the Pronto trip data. In this section, we'll discuss:

  • the structure of a matplotlib plot
  • different plot idioms
  • doing multiple plots

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
# The following ensures that the plots are in the notebook
%matplotlib inline
# We'll also use capabilities in numpy
import numpy as np

Analysis questions

  • Which stations have the biggest difference between in-flow and out-flow of bikes?
  • Where can we localize the movement of bicycles between stations that are in close proximity?

Preparing Data For Visualization

Much of the effort in visualizing data is in preparing the data for visualization. Typically, you'll want to use one or more pandas DataFrame.


In [2]:
df = pd.read_csv("2015_trip_data.csv")
df.head()


Out[2]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
0 431 10/13/2014 10:31 10/13/2014 10:48 SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Annual Member Male 1960
1 432 10/13/2014 10:32 10/13/2014 10:48 SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Annual Member Male 1970
2 433 10/13/2014 10:33 10/13/2014 10:48 SEA00486 883.831 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Annual Member Female 1988
3 434 10/13/2014 10:34 10/13/2014 10:48 SEA00333 865.937 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Annual Member Female 1977
4 435 10/13/2014 10:34 10/13/2014 10:49 SEA00202 923.923 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Annual Member Male 1971

Suppose we want o analyze the flow of bicycles from and to stations.

Question: What data do we need for this visualization? How do we get it?


In [3]:
from_counts = pd.value_counts(df.from_station_id)
to_counts = pd.value_counts(df.to_station_id)

In [4]:
from_counts.head()


Out[4]:
WF-01     6742
BT-01     5885
CBD-13    5385
CH-07     5190
SLU-15    5006
Name: from_station_id, dtype: int64

In [5]:
type(from_counts)


Out[5]:
pandas.core.series.Series

In [6]:
to_counts.head()


Out[6]:
WF-01     7212
CBD-13    7189
BT-01     5800
SLU-07    5390
SLU-15    5328
Name: to_station_id, dtype: int64

Question: How we would get the same information using groupby?

Simple Plots for Series

Let's address the question "Which stations have the biggest difference between the in-flow and out-flow of bicycles?"

What kind of objects are returned from pd.value_counts? Are these plottable? How do we figure this out?


In [12]:
from_counts.plot.bar()


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd95218a110>

We can compare from and to counts with sidey-by-side plots. But to do this, we need a DataFrame with these counts.


In [13]:
df_counts = pd.DataFrame({'from':from_counts, 'to': to_counts})

In [16]:
df_counts.plot(kind='bar', subplots=True, grid=True, title="Counts",
        layout=(1,2), sharex=True, sharey=False, legend=False, figsize=(12, 8))


Out[16]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fd953241410>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd94c936510>]], dtype=object)

Question: How do we make the plots bigger?

But this plot doesn't tell us about the difference between "from" and "to" counts. We want to subtract to_counts from from_counts. Will this difference be plottable?


In [17]:
# What is the index for df_counts?
df_counts.head()


Out[17]:
from to
BT-01 5885 5800
BT-03 4199 3386
BT-04 2221 1856
BT-05 3368 3459
CBD-03 2974 3959

In [ ]:
(from_counts-to_counts).plot.bar()

Question: How do we get rid of the garbage data for the station "Pronto"?


In [22]:
df1 = df_counts[df_counts.index=='Pronto shop']
df1


Out[22]:
from to
Pronto shop 1 13

In [24]:
df_counts[df_counts.index!='Pronto shop'].plot.bar(figsize=(10,6))


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd947d7f750>

Some issues:

  • Bogus value 'Pronto shop'
  • Difficult to read the labels on the x-axis
  • The x and y axis aren't labelled
  • Lost information about "from" and "to"

Writing a Data Cleansing Function

We want to get rid of the row 'Pronto shop' in both from_counts and to_counts.


In [ ]:
# Selecting a row
from_counts[from_counts.index == 'Pronto shop']

In [ ]:
# Deleting a row
new_from_counts = from_counts[from_counts.index != 'Pronto shop']
new_from_counts.plot.bar()

In [ ]:
def simple_clean_rows(df):
    """
    Removes from df all rows with the specified indexes
    :param pd.DataFrame or pd.Series df:
    :return pd.DataFrame or pd.Series:
    """
   
    df = df[df.index != 'Pronto Shop']
    return df

In [ ]:
def clean_rows(df, indexes):
    """
    Removes from df all rows with the specified indexes
    :param pd.DataFrame or pd.Series df:
    :param list-of-str indexes
    :return pd.DataFrame or pd.Series:
    """
    for idx in indexes:
        df = df[df.index != idx]
    return df

In [ ]:
dff = clean_rows(to_counts, ['Pronto Shop', 'CBD-13'])
dff.plot.bar()

Does clean_rows need to return df to effect the change in df?


In [ ]:
to_counts = clean_rows(to_counts, ['Pronto shop'])
to_counts.plot.bar()

In [ ]:
from_counts = clean_rows(from_counts, ['Pronto shop'])
from_counts.plot.bar()

In [ ]:
to_counts.head()

Getting More Control Over Plots

Let's take a more detailed approach to plotting so we can better control what gets rendered.

In this section, we show how to control various elements of plots to produce a desired visualization. We'll use the package matplotlib, a python package that is modelled after MATLAB style plotting.

Make a dataframe out of the count data.


In [ ]:
df_counts = pd.DataFrame({'From': from_counts.sort_index(), 'To': to_counts.sort_index()})

Need to align the counts by the station. Do we do this?


In [ ]:
df_counts.head()

In [ ]:
"""
Basic bar chart using matplotlib
"""
n_groups = len(df_counts.index)
index = np.arange(n_groups)  # The "raw" x-axis of the bar plot

fig = plt.figure(figsize=(12, 8))  # Controls global properties of the bar plot
rects1 = plt.bar(index, df_counts.From)
plt.xlabel('Station')
plt.ylabel('Counts')
plt.xticks(index, df_counts.index)  # Convert "raw" x-axis into labels
_, labels = plt.xticks()  # Get the new labels of the plot
plt.setp(labels, rotation=90)  # Rotate labels to make them readable
plt.title('Station Counts')
plt.show()

Issue - much more code, which will tend to be copied and pasted.

Solution - MAKE A FUNCTION NOW!!!


In [ ]:
def plot_bar1(df, column, opts):
    """
    Does a bar plot for a single column.
    :param pd.DataFrame df:
    :param str column: name of the column to plot
    :param dict opts: key is plot attribute
    """
    n_groups = len(df.index)
    index = np.arange(n_groups)  # The "raw" x-axis of the bar plot
    rects1 = plt.bar(index, df[column])
    if 'xlabel' in opts:
      plt.xlabel(opts['xlabel'])
    if 'ylabel' in opts:
      plt.ylabel(opts['ylabel'])
    if 'xticks' in opts and opts['xticks']:
      plt.xticks(index, df.index)  # Convert "raw" x-axis into labels
      _, labels = plt.xticks()  # Get the new labels of the plot
      plt.setp(labels, rotation=90)  # Rotate labels to make them readable
    else:
      labels = ['' for x in df.index]
      plt.xticks(index, labels)   
    if 'ylim' in opts:
      plt.ylim(opts['ylim'])
    if 'title' in opts:
      plt.title(opts['title'])

In [ ]:
fig = plt.figure(figsize=(12, 8))  # Controls global properties of the bar plot
opts = {'xlabel': 'Stations', 'ylabel': 'Counts', 'xticks': True, 'title': 'A Title'}
plot_bar1(df_counts, 'To', opts)

Comparisons Using Subplots

We want to encapsulate the plotting of N variables into a function. We could re-write plot_bar1. But other plots use this. Besides plot_bar1 is pretty good at handling a single plot. So, instead we use plot_bar1 in a new function.


In [ ]:
def plot_barN(df, columns, opts):
    """
    Does a bar plot for a single column.
    :param pd.DataFrame df:
    :param list-of-str columns: names of the column to plot
    :param dict opts: key is plot attribute
    """
    num_columns = len(columns)
    local_opts = dict(opts)  # Make a deep copy of the object
    idx = 0
    for column in columns:
        idx += 1
        local_opts['xticks'] = False
        local_opts['xlabel'] = ''
        if idx == num_columns:
          local_opts['xticks'] = True
          local_opts['xlabel'] = opts['xlabel']
        plt.subplot(num_columns, 1, idx)
        plot_bar1(df, column, local_opts)

In [ ]:
fig = plt.figure(figsize=(12, 8))  # Controls global properties of the bar plot
opts = {'xlabel': 'Stations', 'ylabel': 'Counts', 'ylim': [0, 8000]}
plot_barN(df_counts, ['To', 'From'], opts)

Question: How write tests for plot_barN?

Exercise

  • Extend the plot_barN to also plot pair-wise differences between plots. Have titles for all plots.

Including Error Bars in a Bar Chart

To make decisions about the truck trips required to adjust bikes at stations, we need to know the variations by day.

Want a bar plot with average daily "to" and "from" with their standard deviations.

Data Preparation

Need to:

  • Create day-of-year column for 'from' and 'to'
  • Compute counts by date
  • Compute the mean and standard deviation of the counts by date

(Assumes that a station has at least one rental every day.)


In [ ]:
df.head()

Let's start with the values for starttime. What type are these?


In [ ]:
print (df.starttime[0])
print (type(df.starttime[0]))

Question: How do we extract the day from a string?

YOU DON'T!!! You convert it to a datetime object.


In [ ]:
this_datetime = pd.to_datetime(df.starttime[0])
print this_datetime

In [ ]:
this_datetime.dayofyear

In [ ]:
start_day = []
for time in df.starttime:
    start_day.append(pd.to_datetime(time).dayofyear)

In [ ]:
start_day[2]

In [ ]:
start_day = [pd.to_datetime(time).dayofyear for time in df.starttime]
stop_day = [pd.to_datetime(x).dayofyear for x in df.stoptime]

In [ ]:
df['startday'] = start_day  # Creates a new column named 'startday'
df['stopday'] = stop_day

In [ ]:
df.head()

In [ ]:
groupby_day_from = df.groupby(['from_station_id', 'startday']).size()
groupby_day_from.head()

In [ ]:
groupby_day_to = df.groupby(['to_station_id', 'stopday']).size()
groupby_day_to.head()

Now we need to compute the average value and its standard deviation across the days for each station. The groupby produced a MultiIndex. So, further operations on the result must take this into account.


In [ ]:
h_index = groupby_day_from.index
h_index.levshape  # Size of the components of the MultiIndex

In [ ]:
from_means = groupby_day_from.groupby(level=[0]).mean()  # Computes the mean of counts by day
from_stds = groupby_day_from.groupby(level=[0]).std()   # Computes the standard deviation

In [ ]:
groupby_day_to = df.groupby(['to_station_id', 'startday']).size()
to_means = groupby_day_to.groupby(level=[0]).mean()  # Computes the mean of counts by day
to_stds = groupby_day_to.groupby(level=[0]).std()   # Computes the standard deviation

In [ ]:
df_day_counts = pd.DataFrame({'from_mean': from_means, 'from_std': from_stds, 'to_mean': to_means, 'to_std': to_stds})
df_day_counts.head()

Plotting with Error Bars


In [ ]:
"""
Plotting two variables as a bar chart with error bars
"""
n_groups = len(df_day_counts.index)
index = np.arange(n_groups)  # The "raw" x-axis of the bar plot
fig = plt.figure(figsize=(12, 8))  # Controls global properties of the bar plot
bar_width = 0.35  # Width of the bars
opacity = 0.6  # How transparent the bars are

#VVVV Changed to do two plots with error bars
error_config = {'ecolor': '0.3'}
rects1 = plt.bar(index, df_day_counts.from_mean, bar_width,
                 alpha=opacity,
                 color='b',
                 yerr=df_day_counts.from_std,
                 error_kw=error_config,
                 label='From')
rects2 = plt.bar(index + bar_width, df_day_counts.to_mean, bar_width,
                 alpha=opacity,
                 color='r',
                 yerr=df_day_counts.to_std,
                 error_kw=error_config,
                 label='to')
#^^^^ Changed to do two plots with error bars

plt.xticks(index + bar_width / 2, df_counts.index)
_, labels = plt.xticks()  # Get the new labels of the plot
plt.setp(labels, rotation=90)  # Rotate labels to make them readable
plt.legend()

plt.xlabel('Station')
plt.ylabel('Counts')
plt.title('Station Counts')
plt.show()

In-class exercise

Change the above script for plotting with error bars into a function and verify that you can call this function and get the same plot as the one above.

  • What are the inputs to your function and why?
  • How would you change plot_barN to use this function?